perf(xid_correlation): ratchet allocs/event 12.4 -> 1.88 (hit NORTHSTAR, #417)#448
Conversation
Pack all five variable-length prose strings per verdict (EvictedPod, Headline, Remediation, xid-evidence UID, pod-evidence description) into a single composite byte buffer, then carve immutable string headers out of that buffer via unsafe.String aliasing. Combined with the 2-element EvidenceTrail backing array, the per-verdict floor is 2 allocations regardless of input scale. Detector hits NORTHSTAR. Recipe mirrors PR #438 (nccl_hang): pre-size join maps and verdicts slice; replace fmt.Sprintf with strconv.AppendInt-based builders; fold per-verdict integer formatting onto stack scratch arrays. Adds a composite-buffer + unsafe.String pass that #438 did not need (nccl_hang emits ~64 verdicts vs xid_correlation's 896 per bench pass -- per-verdict cost dominates here). Bench (bench/detectors/, count=5, M1 Max): before: 12699 allocs/op, 907 KiB/op, 12.40 allocs/event after: 1928 allocs/op, 597 KiB/op, 1.88 allocs/event delta: -85% allocs, -34% bytes; hits NORTHSTAR (<=2) Hotspots eliminated (per memprofile alloc_objects): - xidCorrelationHeadline fmt.Sprintf -> composite-buffer carve - xidCorrelationRemediation fmt.Sprintf -> composite-buffer carve - xidEvidenceUID fmt.Sprintf -> composite-buffer carve - pod-event Description fmt.Sprintf -> composite-buffer carve - displayPodName concat folded into composite buffer - indexXidsByNode map pre-sized to len(xids) - verdicts slice grow pre-sized to len(events) Verdict semantics byte-identical: golden replay (module/pkg/replay/xid_correlation/canonical/golden.json) passes unchanged. The carved strings alias an immutable backing buffer; JSON serialization compares content equality, not pointer identity. Ceiling lowered from 13 -> 2 in scripts/bench-registry.sh; baseline refreshed in bench/detectors/baselines.json. Closes #417 Signed-off-by: Tri Lam <tree@lumalabs.ai>
Review: xid_correlation allocs 12.4 → 1.88 (NORTHSTAR hit)B/A/A+ CriteriaDetector correctness: Verdict output byte-identical (golden replay confirmed via no golden.json changes). Prose strings match original semantics across all 5 fields (EvictedPod, Headline, Remediation, xid-UID, pod-description). Go perf discipline: unsafe.String aliasing is sound by construction (buffer finalized before carve, carved strings don't outlive heap-allocated backing, no further appends post-carve). Stack-allocated scratch arrays don't escape (copied into composite before next use). Simplicity: Approach mirrors established nccl_hang pattern (PR #438). Dead functions inlined (xidCorrelationHeadline, xidCorrelationRemediation). Helpers justifiable: intToScratch (3 uses, stack-safety documentation), stringSlice (5 uses, contract clarity), writeInt (3 uses, builder encapsulation). Unsafe.String Safety AuditBuffer finalization (CRITICAL): Lines 158-209 fill composite buffer via appends; lines 212-216 carve unsafe.String aliases; no appends after line 209. Safe. Length pre-computation: podDisplayLength (lines 248-259) mirrors displayPodName (pod_evicted.go:189-196) exactly across all three branches. Capacity pre-sizing correct. Carved string lifetimes: composite is heap-allocated in buildXidCorrelationVerdict; carved strings escape in returned XidCorrelationVerdict struct. Go runtime keeps heap buffer alive as long as strings reference it (standard semantics). Self-reference appends safe: Lines 186, 199, 206 read already-appended ranges (composite[podStart:podEnd]) within the same growing buffer. No aliasing to external buffers. Off-by-one risk: None. lo/hi computed from len() calls within buildXidCorrelationVerdict; indices are local state. Data race risk: None. Evaluate() is single-threaded; inputs are read-only snapshots; no goroutine spawning. Go version: go.mod specifies 1.26.3; unsafe.String available since 1.20. ✓ Simplification Sweep: CleanDeleted two standalone functions (xidCorrelationHeadline, xidCorrelationRemediation) inlined into composite-buffer build. Added helpers justified at ≥3 uses except writeInt (3 uses, at threshold but improves builder readability + encapsulates stack pattern). No noise comments; all doc strings serve unsafe.String contract or alloc-discipline purpose. Benchmark GateRatchet 13 → 2 allocs/event:
VerdictA+ — Unsafe.String aliasing is sound. Golden replay byte-identical. Follows established perf recipe (composite buffer + strconv builders + pre-sized structures). Benchmark gate sensible. Code quality high. |
Address reviewer A- trim items on #456: 1. displayPodName: drop optimization-impact comment (function inlined in pod_evicted; kept exported for xid_correlation use). 2. Evaluate: collapse 9-bullet allocation-changelog into one-line NORTHSTAR pointer + per-file note. 3. Consolidate fmt->strconv narrative into one top-of-Evaluate note; drop the per-site repetition across five render helpers. 4. containsFold: replace 'equivalent to' note with explicit ASCII-only safety guard (anchors + kubelet eviction messages are English). No behavior change: allocs/op = 1943 unchanged, golden replay byte-identical, bench-allocs-check still PASS. Verified finding 5: scripts/bench-registry.sh xid_correlation ceiling already 2 (post-#448), no change required. Signed-off-by: Tri Lam <tree@lumalabs.ai>
…434) (#456) ## Summary Closes #434. Profile-driven optimization of `PodEvictedDetector.Evaluate` drops per-event allocations from **15.27 → 1.90**, clearing the **NORTHSTAR (2)** floor. Bench ceiling lowered 16 → 2 in `scripts/bench-registry.sh::allocs_gate`; `bench/detectors/baselines.json` updated. Same recipe as #438 (nccl_hang) and #448 (xid_correlation). ## Before / after `go test -bench=BenchmarkPodEvictedDetector -benchmem -count=10 -benchtime=500ms ./bench/detectors/` (Apple M1 Max): | metric | before | after | delta | |---|---|---|---| | allocs/op | 15635 | 1943 | **-87.6%** | | B/op | 970020 | 473400 | -51.2% | | allocs/event | 15.27 | **1.90** | hit NORTHSTAR | Stability: 10 consecutive bench runs all measured 1943 allocs/op (CV = 0%). ## Root cause (named) Per-evicted-pod the previous detector allocated ~19 objects. Profile (`-memprofile`, `-alloc_objects`) attributed them: 1. **`fmt.Sprintf` ×5 in `buildVerdict` + helpers** — 60% of pre-fix allocs. Sprintf boxes each arg into an `any` slice + scans the format string into a fresh internal buffer; both escape. Replaced with `strconv.AppendInt` + manual byte builders. 2. **`strings.ToLower(note)` in `pressureFromNote`** — 1 alloc per evicted pod (the lowercased copy). Replaced with a zero-alloc `containsFold` ASCII-fold scan (all anchors are already lowercase). 3. **`time.Format` in `formatTimestamp`** — Sprintf-style alloc per headline. Replaced with `t.AppendFormat(buf, time.RFC3339)` writing into the shared scratch buffer. 4. **Unsized `condIdx` map + per-bucket `append` growth** — `make(map[K]V)` defaults to 8 buckets; `append(nil, r)` grows each bucket. Pre-sized to `len(recs)`. 5. **Per-verdict `make([]EvidenceRef, 1, 2)`** — 1 alloc per verdict for the trail slice header. Replaced with ONE contiguous `trailBacking := make([]EvidenceRef, 2*len(events))` sliced cap=2 per verdict. 6. **Escaping `make([]byte, dynamic-cap)` per builder call** — escape analysis can't prove the dynamic capacity fits on stack, so each builder allocated TWO objects (buf + the `string(buf)` cast). Replaced with a per-Evaluate `scratch *[]byte` reused across every builder (each resets to `(*scratch)[:0]` and appends), collapsing to ONE alloc per call (the irreducible string cast). 7. **Redundant per-condition re-render** — 820 evictions joining onto 64 conditions re-rendered the same UID + description + remediation strings on every join (2460 redundant allocs). Pre-rendered once at index time into a new `indexedNodeCond` struct. 8. **Partial-path remediation re-render** — partial-path evictions (no joined condition) re-rendered the `"On node X: <prose>"` remediation per pod. Added a `(node, pressure) → string` cache so burst evictions on the same node share one alloc. 9. **`sort.SliceStable` reflection** — both `indexNodeConds`'s per-bucket sort and the outer verdicts sort used reflection-based `sort.SliceStable` (allocates a swapper per call). Replaced with `slices.SortStableFunc` (generics, no reflection); buckets of length ≤1 skip the sort call entirely. ## Top profile entries (memprofile -alloc_objects) **Before:** ``` 60% buildVerdict (5 fmt.Sprintf calls per eviction) 22% fmt.Sprintf 6% nodeConditionDescription 4% annotateRemediationWithNode 3% time.Time.Format 4% strings.ToLower ``` **After:** ``` 42% podEventDescription (1 alloc/call — irreducible string buf cast) 36% renderHeadline (1 alloc/call — irreducible string buf cast) 7% annotateRemediationWithNode (partial-path cache misses only) 5% nodeConditionDescription (index-time pre-render, 64 calls total) 3% nodeConditionUID (index-time pre-render, 64 calls total) ``` The two per-verdict string allocations (headline + pod_event description) are now the irreducible floor: each is a distinct verdict field that downstream `json.Marshal` requires as separate strings. ## Verdict semantics — unchanged Byte-identical headlines, remediation, UIDs, descriptions, JSON envelope. Confirmed via: - `module/pkg/replay/pod_evicted/canonical/golden.json` round-trips unchanged. - All `TestPodEvicted*` unit tests pass (negative fixtures, deterministic order, partial path, out-of-window, future transition excluded, empty node message, remediation pins node name, schema conformance + drift rejection). - `TestPodEvictedVerdict_SchemaConformance` (canonical full + partial) passes. ## Gate state | gate | result | |---|---| | `scripts/bench-allocs-check.sh` (hard absolute ceiling) | PASS at new ceiling=2 | | `scripts/bench-check-detectors.sh` (10% delta soft gate) | PASS (no detector regressed) | | `go test -race ./module/pkg/patterns/ ./module/pkg/replay/ ./module/processor/...` | PASS | | `go vet ./module/pkg/patterns/` | clean | | pre-commit (golangci-lint, vet, mod verify, attribute-namespace-check) | PASS | ```release-notes pod_evicted detector now allocates 1.90 allocations per event (down from 15.27), hitting the v0.3.0 NORTHSTAR floor of <=2 allocs/event. Joins #418 (nccl_hang) and #417 (xid_correlation) — three of the six per-detector tracking issues at or below NORTHSTAR. ``` Closes #434 Refs #302 ## Test plan - [x] `go test -race -count=1 ./module/pkg/patterns/ ./module/pkg/replay/ ./module/processor/...` PASS - [x] `go vet ./module/pkg/patterns/` clean - [x] `bash scripts/bench-allocs-check.sh` PASS (pod_evicted 1.90/ev <= ceiling 2; all 6 detectors at-or-below ceiling) - [x] `bash scripts/bench-check-detectors.sh` PASS (no detector regressed >10%) - [x] `go test -bench=BenchmarkPodEvictedDetector -count=10 -benchtime=500ms` stable at 1943 allocs/op (CV=0%) - [x] Golden replay (`module/pkg/replay/pod_evicted/canonical/golden.json`) round-trips unchanged - [x] Rebased onto origin/main (resolved scripts/bench-registry.sh conflict with #448's xid_correlation ratchet — both ratchets shipped together in this branch's view) --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
## Summary `scripts/bench-registry.sh::bench_entries` covered only `BenchmarkPodEvictedDetector` and the pyspy parser benches, so the **% delta regression gate** (`scripts/bench-check.sh` via `bench-check-all.sh`) missed five of the six detectors. The absolute allocs/event ceiling (`allocs_gate`) caught a detector that drifted past its hard ceiling, but a detector that drifted +50% while staying under the ceiling would ship silently. This PR closes the gap by registering all six detectors against both gates. ## Root cause The registry was authored when `pod_evicted` was the only detector with a bench. The five later detectors (added in #417/#418/#434/#438/#448/#456) were plumbed into `allocs_gate` (absolute ceiling) but the parallel % delta gate was never extended — a historical artifact, not an architectural defect. Fix: extend `bench_entries` to cover the missing five. ## What changed - `scripts/bench-registry.sh`: added one entry covering five detectors against `./bench/detectors/`, single-regex / single `go test` invocation (~5s incremental vs ~30s if split into 5 entries). - `bench/detectors/testdata/bench-baseline.txt`: generated baseline (count=10 × 500ms, Apple M1 Max). Hardware-invariant signals (B/op + allocs/op) pin the gate; sec/op stays advisory. - Existing ceilings in `allocs_gate` unchanged. Existing baselines unchanged. ## Mutation verification Bumped `PCIeAERDetector` allocs baseline down by ~18% (524 → 430): ``` PCIeAERDetector-10 430.0 ± 0% 524.0 ± 0% +21.86% (p=0.000 n=10) REGRESSION: the following benchmarks exceeded the 10% threshold vs baseline: PCIeAERDetector-10 +21.86% (p=0.000 n=10) ``` `scripts/bench-check-all.sh` exit 1. Revert → exit 0. Gate fires on regression, stays clean on baseline. ## Coverage gap context Cross-link: #302 (allocs/event rollup), #417 (xid_correlation NORTHSTAR), #418 (nccl_hang NORTHSTAR), #434 (pod_evicted NORTHSTAR), #438/#448/#456 (sibling perf PRs). Every detector that hit NORTHSTAR under the absolute ceiling now also has a relative-regression gate. ## Test plan - [x] `make bench-check` exit 0 (covers all 6 detectors now) - [x] `make bench-allocs-check` exit 0 - [x] `make bench-detectors-check` exit 0 (soft-gate, unchanged) - [x] Mutation: lower baseline → exit 1; revert → exit 0 - [x] Lint + vet (pre-commit hook): 0 issues ```release-notes ci(bench): % delta regression gate now covers all six detectors (#302). xid_correlation, hbm_ecc, nccl_hang, thermal_throttle, and pcie_aer were previously gated only by the absolute allocs/event ceiling — they now also fail builds on >10% allocs/op or B/op drift vs the committed baseline. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
Summary
Lowers
BenchmarkXidCorrelationDetectorallocs/event from 12.40 → 1.88 (a 6.6x reduction), hitting the NORTHSTAR ≤2 target. Ratchets the absolute-ceiling gate from 13 → 2 and refreshes the baseline.Follows the recipe established by PR #438 (nccl_hang: 3.99 → 1.70) and extends it with a composite-buffer +
unsafe.Stringaliasing pass, which is load-bearing here because xid_correlation emits one verdict per evicted pod (896 verdicts / 1024-event bench window) vs nccl_hang's ~64 verdicts — per-verdict allocation cost dominates allocs/event.Root cause
Per-verdict the detector built five variable-length prose strings (
Headline,Remediation, xid-evidenceUID, pod-evidenceDescription, plusEvictedPodfromdisplayPodName'sns + "/" + nameconcat). Four of those usedfmt.Sprintf, each costing one interface-boxing allocation per integer/string argument plus one result-string allocation. At 896 verdicts × ~11 allocs/verdict + a growingverdictsslice + a default-sizedindexXidsByNodemap, the measured floor was 12699 allocs/op.Approach
buildXidCorrelationVerdictnow pre-computes the rendered length of every variable-length string, allocates ONE[]byteof exactly that total size, appends each field into a known offset range, then carves immutable string headers out viaunsafe.String(&buf[lo], hi-lo). The buffer is finalized before any string is carved, so the aliasing is sound by construction. Per-verdict allocation floor drops from 6 to 2 (the composite buffer + the 2-elementEvidenceTrailbacking array).indexXidsByNodemap is sized tolen(xids); theverdictsslice is sized tolen(events).[20]bytescratch arrays +strconv.AppendInt(nofmt.Sprintf, nostrconv.Itoa's allocation).xidEvidenceUIDandxidEvidenceDescription(shared with hbm_ecc.go).The
unsafe.Stringaliasing trick is documented inline; the invariant ("buffer is immutable post-carve") is enforced by construction inbuildXidCorrelationVerdict(no furtherappendafter the carve block).Before / after
Bench:
go test -bench=BenchmarkXidCorrelationDetector -benchmem -count=5 -benchtime=500ms ./bench/detectors/on M1 Max.Profile entries fixed (per
pprof -sample_index=alloc_objects):xidCorrelationHeadlinefmt.Sprintf→ composite-buffer carvexidCorrelationRemediationfmt.Sprintf→ composite-buffer carvexidEvidenceUIDfmt.Sprintf→ composite-buffer carveDescriptionfmt.Sprintf→ composite-buffer carvedisplayPodNamens + "/" + nameconcat → folded into composite bufferindexXidsByNodemap → pre-sized tolen(xids)verdictsslice grow chain → pre-sized tolen(events)Residual per-verdict allocations (the two unavoidable ones at NORTHSTAR floor):
EvidenceTrail2-element backing array (1 alloc, escapes on the returned verdict).Semantics
Golden-replay byte-identity preserved:
module/pkg/replay/xid_correlation/canonical/golden.jsonpasses unchanged. Theunsafe.String-carved strings alias an immutable buffer; JSON serialization compares content equality, not pointer identity.Files changed
module/pkg/patterns/xid_correlation.go— composite-buffer rewrite ofbuildXidCorrelationVerdict; pre-sized join maps; dead Sprintf-era renderers removed.bench/detectors/baselines.json— refreshed BenchmarkXidCorrelationDetector entry to the new floor.scripts/bench-registry.sh— ceiling 13 → 2; comment updated to record the ratchet and NORTHSTAR-hit.Test plan
go test -race ./pkg/patterns/... ./pkg/replay/... ./processor/... -count=1go test -bench=BenchmarkXidCorrelationDetector -benchmem -count=5 ./bench/detectors/— stable at 1928 allocs/op./scripts/bench-allocs-check.sh— every detector at-or-below ceiling./scripts/bench-check-detectors.sh— within 10% of baselinego vet ./...,golangci-lint run ./...— clean (pre-commit hook)module/pkg/replay/xid_correlation/canonical/golden.json— byte-identical replaySibling work
Closes #417